Introduction

Many colleges want to optimize the money they receive from their alumni. In order to do so, they need to identify and predict the salary/unemployment rate of recent graduates based on their education and other various factors. Doing so, they will be able to put more money into those programs to get a larger return on their investments (students).

Business Question:

Where can colleges put money in order to optimize the amount of money they receive from recent graduates?

Analysis Question:

Based on recent graduates and their characteristics/education, what would be their predicted median salary? Would they make over or less than six figures?

Background Information

This data is pulled from the 2012-12 American Community Survey Public Use Microdata Series, and is limited to those users under the age of 28. The general purpose of this code and data is based upon this story.

Process Overview

What will we be doing? Methods, techniques, why?

Data Cleaning

A brief look at the raw data can be found below.

## 'data.frame':    172 obs. of  21 variables:
##  $ Rank                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Major_code          : int  2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
##  $ Major               : chr  "PETROLEUM ENGINEERING" "MINING AND MINERAL ENGINEERING" "METALLURGICAL ENGINEERING" "NAVAL ARCHITECTURE AND MARINE ENGINEERING" ...
##  $ Total               : int  2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
##  $ Men                 : int  2057 679 725 1123 21239 2200 2110 832 80320 65511 ...
##  $ Women               : int  282 77 131 135 11021 373 1667 960 10907 16016 ...
##  $ Major_category      : chr  "Engineering" "Engineering" "Engineering" "Engineering" ...
##  $ ShareWomen          : num  0.121 0.102 0.153 0.107 0.342 ...
##  $ Sample_size         : int  36 7 3 16 289 17 51 10 1029 631 ...
##  $ Employed            : int  1976 640 648 758 25694 1857 2912 1526 76442 61928 ...
##  $ Full_time           : int  1849 556 558 1069 23170 2038 2924 1085 71298 55450 ...
##  $ Part_time           : int  270 170 133 150 5180 264 296 553 13101 12695 ...
##  $ Full_time_year_round: int  1207 388 340 692 16697 1449 2482 827 54639 41413 ...
##  $ Unemployed          : int  37 85 16 40 1672 400 308 33 4650 3895 ...
##  $ Unemployment_rate   : num  0.0184 0.1172 0.0241 0.0501 0.0611 ...
##  $ Median              : int  110000 75000 73000 70000 65000 65000 62000 62000 60000 60000 ...
##  $ P25th               : int  95000 55000 50000 43000 50000 50000 53000 31500 48000 45000 ...
##  $ P75th               : int  125000 90000 105000 80000 75000 102000 72000 109000 70000 72000 ...
##  $ College_jobs        : int  1534 350 456 529 18314 1142 1768 972 52844 45829 ...
##  $ Non_college_jobs    : int  364 257 176 102 4440 657 314 500 16384 10874 ...
##  $ Low_wage_jobs       : int  193 50 0 0 972 244 259 220 3253 3170 ...
##  - attr(*, "na.action")= 'omit' Named int 22
##   ..- attr(*, "names")= chr "22"

As can be seen above, many of the categories are integer values. Many of these variables can be converted into factor variables in addition to the numerical ones. In addition, the variables Rank, Major Code, and Major can be dropped as the Rank variable highly correlates with the salary variable, and the other two are to specific and cannot be generalized.

majors_added_categorical <- majors_raw %>% mutate(Over.50K = ifelse(Median > 50000, "Over", "Under.Equal"), High.Unemployment = ifelse(Unemployment_rate > 0.5, "High", "Low")) %>% select(-1, -2, -3)

In addition, the categorical variable categories can be compressed in order for more useful data for the analysis.

## 
## Sciences     Arts    Other     STEM 
##       54       30       48       40

In order to do some analysis, all categorical variables need to be one hot encoded, which is done below:

# One Hot Encoded Data
majors_onehot <- one_hot(data.table(majors_factors), cols = c("Major_category", "High.Unemployment"))
# Normal Data
majors <- majors_factors

Exploratory Data Analysis

Before beginning with the analytical part of the exploration, it is beneficial to visualize and summarize the data in order to get a better understanding of the data in its entirety, and with an emphasis on variables you believe to be important for your analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22000   33000   36000   40077   45000  110000
##                 Total        Men     Women ShareWomen Sample_size  Employed
## Total       1.0000000  0.8780884 0.9447645  0.1429993   0.9455747 0.9962140
## Men         0.8780884  1.0000000 0.6727589 -0.1120136   0.8751756 0.8706047
## Women       0.9447645  0.6727589 1.0000000  0.2978321   0.8626064 0.9440365
## ShareWomen  0.1429993 -0.1120136 0.2978321  1.0000000   0.0974957 0.1475468
## Sample_size 0.9455747  0.8751756 0.8626064  0.0974957   1.0000000 0.9644062
##             Full_time Part_time Full_time_year_round Unemployed
## Total       0.9893392 0.9502684            0.9811118  0.9747684
## Men         0.8935631 0.7515917            0.8924540  0.8694115
## Women       0.9176812 0.9545133            0.9057195  0.9116943
## ShareWomen  0.1202001 0.2122898            0.1125230  0.1212430
## Sample_size 0.9783624 0.8245444            0.9852125  0.9179335
##             Unemployment_rate     Median       P25th       P75th College_jobs
## Total              0.08319170 -0.1067377 -0.07192608 -0.08319767    0.8004648
## Men                0.10150234  0.0259906  0.03872518  0.05239290    0.5631684
## Women              0.05910776 -0.1828419 -0.13773826 -0.16452834    0.8519460
## ShareWomen         0.07320458 -0.6186898 -0.50019863 -0.58693216    0.1955501
## Sample_size        0.06295494 -0.0644750 -0.02442859 -0.05225614    0.7012309
##             Non_college_jobs Low_wage_jobs
## Total              0.9412471     0.9355096
## Men                0.8514998     0.7913360
## Women              0.8721318     0.9044699
## ShareWomen         0.1370066     0.1878496
## Sample_size        0.9153352     0.8601159

Data Vizualization

Model Building Linear Regression

## [1] 172  22
## [1] 121  22
## [1] 26 22
## [1] 25 22
## Classes 'data.table' and 'data.frame':   121 obs. of  21 variables:
##  $ Total                  : int  2339 756 856 2573 1792 81527 41542 14955 4321 8925 ...
##  $ Men                    : int  2057 679 725 2200 832 65511 33258 8407 3526 6062 ...
##  $ Women                  : int  282 77 131 373 960 16016 8284 6548 795 2863 ...
##  $ Major_category_Sciences: int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Major_category_Arts    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Major_category_Other   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Major_category_STEM    : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ ShareWomen             : num  0.121 0.102 0.153 0.145 0.536 ...
##  $ Sample_size            : int  36 7 3 17 10 631 399 79 30 55 ...
##  $ Employed               : int  1976 640 648 1857 1526 61928 32506 10047 3608 6170 ...
##  $ Full_time              : int  1849 556 558 2038 1085 55450 30315 9017 2999 5455 ...
##  $ Part_time              : int  270 170 133 264 553 12695 5146 2694 811 1983 ...
##  $ Full_time_year_round   : int  1207 388 340 1449 827 41413 23621 5986 2004 3413 ...
##  $ Unemployed             : int  37 85 16 400 33 3895 2275 1019 23 589 ...
##  $ Unemployment_rate      : num  0.0184 0.1172 0.0241 0.1772 0.0212 ...
##  $ P25th                  : int  95000 55000 50000 50000 31500 45000 45000 36000 25000 40000 ...
##  $ P75th                  : int  125000 90000 105000 102000 109000 72000 75000 70000 74000 76000 ...
##  $ College_jobs           : int  1534 350 456 1142 972 45829 23694 6439 2439 3603 ...
##  $ Non_college_jobs       : int  364 257 176 657 500 10874 5721 2471 947 1595 ...
##  $ Low_wage_jobs          : int  193 50 0 244 220 3170 980 789 263 524 ...
##  $ High.Unemployment_Low  : int  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
## C5.0 
## 
## 121 samples
##  21 predictor
##   2 classes: 'Over', 'Under.Equal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 109, 109, 108, 110, 109, 110, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.9063986  0.5958991
##   rules  FALSE   10      0.9128089  0.6231278
##   rules  FALSE   20      0.9143473  0.6326592
##   rules   TRUE    1      0.9234149  0.6373265
##   rules   TRUE   10      0.9200816  0.6287551
##   rules   TRUE   20      0.9184149  0.6212551
##   tree   FALSE    1      0.9047319  0.6025318
##   tree   FALSE   10      0.9128089  0.6283924
##   tree   FALSE   20      0.9140909  0.6399963
##   tree    TRUE    1      0.9217483  0.6352432
##   tree    TRUE   10      0.9147786  0.6193285
##   tree    TRUE   20      0.9147786  0.6193285
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = rules and winnow
##  = TRUE.

Prediction

## Confusion Matrix and Statistics
## 
##              Actual
## Prediction    Over Under.Equal
##   Over           3           0
##   Under.Equal    1          22
##                                          
##                Accuracy : 0.9615         
##                  95% CI : (0.8036, 0.999)
##     No Information Rate : 0.8462         
##     P-Value [Acc > NIR] : 0.07441        
##                                          
##                   Kappa : 0.8354         
##                                          
##  Mcnemar's Test P-Value : 1.00000        
##                                          
##             Sensitivity : 0.7500         
##             Specificity : 1.0000         
##          Pos Pred Value : 1.0000         
##          Neg Pred Value : 0.9565         
##              Prevalence : 0.1538         
##          Detection Rate : 0.1154         
##    Detection Prevalence : 0.1154         
##       Balanced Accuracy : 0.8750         
##                                          
##        'Positive' Class : Over           
## 
# Given a certain values for the other variables predict the Median Salary

Evaluation

## C5.0 variable importance
## 
##   only 20 most important variables shown (out of 21)
## 
##                         Overall
## P75th                       100
## Unemployed                    0
## Low_wage_jobs                 0
## ShareWomen                    0
## Unemployment_rate             0
## Sample_size                   0
## High.Unemployment_Low         0
## Major_category_STEM           0
## Full_time_year_round          0
## Total                         0
## Men                           0
## Employed                      0
## Non_college_jobs              0
## Part_time                     0
## Major_category_Arts           0
## P25th                         0
## College_jobs                  0
## Major_category_Sciences       0
## Women                         0
## Major_category_Other          0
## C5.0 
## 
## 121 samples
##  21 predictor
##   2 classes: 'Over', 'Under.Equal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 109, 109, 108, 110, 109, 110, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE   20      0.9143473  0.6326592
##   rules  FALSE   30      0.9142191  0.6272642
##   rules  FALSE   40      0.9158858  0.6379460
##   rules   TRUE   20      0.9184149  0.6212551
##   rules   TRUE   30      0.9184149  0.6212551
##   rules   TRUE   40      0.9184149  0.6212551
##   tree   FALSE   20      0.9140909  0.6399963
##   tree   FALSE   30      0.9158858  0.6404803
##   tree   FALSE   40      0.9158858  0.6404803
##   tree    TRUE   20      0.9147786  0.6193285
##   tree    TRUE   30      0.9147786  0.6193285
##   tree    TRUE   40      0.9147786  0.6193285
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = rules and
##  winnow = TRUE.
## C5.0 
## 
## 121 samples
##  21 predictor
##   2 classes: 'Over', 'Under.Equal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 109, 109, 108, 110, 109, 110, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.9063986  0.5958991
##   rules  FALSE   10      0.9128089  0.6231278
##   rules  FALSE   20      0.9143473  0.6326592
##   rules   TRUE    1      0.9234149  0.6373265
##   rules   TRUE   10      0.9200816  0.6287551
##   rules   TRUE   20      0.9184149  0.6212551
##   tree   FALSE    1      0.9047319  0.6025318
##   tree   FALSE   10      0.9128089  0.6283924
##   tree   FALSE   20      0.9140909  0.6399963
##   tree    TRUE    1      0.9217483  0.6352432
##   tree    TRUE   10      0.9147786  0.6193285
##   tree    TRUE   20      0.9147786  0.6193285
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = rules and winnow
##  = TRUE.
## Confusion Matrix and Statistics
## 
##              Actual
## Prediction    Over Under.Equal
##   Over           4           0
##   Under.Equal    0          22
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8677, 1)
##     No Information Rate : 0.8462     
##     P-Value [Acc > NIR] : 0.01299    
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.1538     
##          Detection Rate : 0.1538     
##    Detection Prevalence : 0.1538     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : Over       
## 
## Confusion Matrix and Statistics
## 
##              Actual
## Prediction    Over Under.Equal
##   Over           2           3
##   Under.Equal    1          19
##                                           
##                Accuracy : 0.84            
##                  95% CI : (0.6392, 0.9546)
##     No Information Rate : 0.88            
##     P-Value [Acc > NIR] : 0.8266          
##                                           
##                   Kappa : 0.4118          
##                                           
##  Mcnemar's Test P-Value : 0.6171          
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.8636          
##          Pos Pred Value : 0.4000          
##          Neg Pred Value : 0.9500          
##              Prevalence : 0.1200          
##          Detection Rate : 0.0800          
##    Detection Prevalence : 0.2000          
##       Balanced Accuracy : 0.7652          
##                                           
##        'Positive' Class : Over            
## 

Model Building Classification Random Forest

## [1] 0.3953488
## 
## LE.EQ.20K     G.50K 
##       104        68
## [1] 121  21
## [1] 25 21
## [1] 26 21
## [1] 4.472136
##    X1.nrow.combined_RF.err.rate.       OOB LE.EQ.20K     G.50K
## 1                              1 0.2982456 0.3666667 0.2222222
## 2                              2 0.2325581 0.2549020 0.2000000
## 3                              3 0.2475248 0.2372881 0.2619048
## 4                              4 0.2110092 0.1904762 0.2391304
## 5                              5 0.2280702 0.1617647 0.3260870
## 6                              6 0.2288136 0.1830986 0.2978723
## 7                              7 0.2000000 0.1506849 0.2765957
## 8                              8 0.2250000 0.1917808 0.2765957
## 9                              9 0.1735537 0.1095890 0.2708333
## 10                            10 0.1900826 0.1232877 0.2916667
## 'data.frame':    121 obs. of  21 variables:
##  $ Total               : int  2339 756 1258 32260 3777 1792 91227 81527 15058 14955 ...
##  $ Men                 : int  2057 679 1123 21239 2110 832 80320 65511 12953 8407 ...
##  $ Women               : int  282 77 135 11021 1667 960 10907 16016 2105 6548 ...
##  $ Major_category      : Factor w/ 4 levels "Sciences","Arts",..: 4 4 4 4 3 1 4 4 4 4 ...
##  $ ShareWomen          : num  0.121 0.102 0.107 0.342 0.441 ...
##  $ Sample_size         : int  36 7 16 289 51 10 1029 631 147 79 ...
##  $ Employed            : int  1976 640 758 25694 2912 1526 76442 61928 11391 10047 ...
##  $ Full_time           : int  1849 556 1069 23170 2924 1085 71298 55450 11106 9017 ...
##  $ Part_time           : int  270 170 150 5180 296 553 13101 12695 2724 2694 ...
##  $ Full_time_year_round: int  1207 388 692 16697 2482 827 54639 41413 8790 5986 ...
##  $ Unemployed          : int  37 85 40 1672 308 33 4650 3895 794 1019 ...
##  $ Unemployment_rate   : num  0.0184 0.1172 0.0501 0.0611 0.0957 ...
##  $ Median              : int  110000 75000 70000 65000 62000 62000 60000 60000 60000 60000 ...
##  $ P25th               : int  95000 55000 43000 50000 53000 31500 48000 45000 42000 36000 ...
##  $ P75th               : int  125000 90000 80000 75000 72000 109000 70000 72000 70000 70000 ...
##  $ College_jobs        : int  1534 350 529 18314 1768 972 52844 45829 8184 6439 ...
##  $ Non_college_jobs    : int  364 257 102 4440 314 500 16384 10874 2425 2471 ...
##  $ Low_wage_jobs       : int  193 50 0 972 259 220 3253 3170 372 789 ...
##  $ Over.50K            : Factor w/ 2 levels "Over","Under.Equal": 1 1 1 1 1 1 1 1 1 1 ...
##  $ High.Unemployment   : Factor w/ 1 level "Low": 1 1 1 1 1 1 1 1 1 1 ...
##  $ combined_target     : Factor w/ 2 levels "LE.EQ.20K","G.50K": 1 1 1 2 2 2 1 1 1 2 ...
## mtry = 4  OOB error = 20.66% 
## Searching left ...
## mtry = 2     OOB error = 19.01% 
## 0.08 0.05 
## mtry = 1     OOB error = 29.75% 
## -0.5652174 0.05 
## Searching right ...
## mtry = 8     OOB error = 14.88% 
## 0.2173913 0.05 
## mtry = 16    OOB error = 9.09% 
## 0.3888889 0.05 
## mtry = 20    OOB error = 11.57% 
## -0.2727273 0.05

## 
## Call:
##  randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1]) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 16
## 
##         OOB estimate of  error rate: 12.4%
## Confusion matrix:
##           LE.EQ.20K G.50K class.error
## LE.EQ.20K        65     8   0.1095890
## G.50K             7    41   0.1458333

Tuning

Because the built in Random Forest Model was not agreeable with the tuning done with the caret library, an original random forest classification tuning metric was created in order to determine the best values for the three hyperparameters determined above.

Now, we can set the hyperparameter values to try and tune the model.

##    .mtry .sampsize .ntree
## 1      3        50    200
## 2      4        50    200
## 3      5        50    200
## 4      3       100    200
## 5      4       100    200
## 6      5       100    200
## 7      3       200    200
## 8      4       200    200
## 9      5       200    200
## 10     3        50    300
## 11     4        50    300
## 12     5        50    300
## 13     3       100    300
## 14     4       100    300
## 15     5       100    300
## 16     3       200    300
## 17     4       200    300
## 18     5       200    300
## 19     3        50    400
## 20     4        50    400
## 21     5        50    400
## 22     3       100    400
## 23     4       100    400
## 24     5       100    400
## 25     3       200    400
## 26     4       200    400
## 27     5       200    400
## 121 samples
##  19 predictor
##   2 classes: 'Over', 'Under.Equal' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 97, 97, 97, 96, 97, 97, ... 
## Resampling results across tuning parameters:
## 
##   mtry  sampsize  ntree  ROC        Sens       Spec     
##   3      50       200    0.9903810  0.8533333  1.0000000
##   3      50       300    0.9910159  0.8300000  1.0000000
##   3      50       400    0.9910159  0.8266667  1.0000000
##   3     100       200    0.9913333  0.8500000  1.0000000
##   3     100       300    0.9871429  0.8266667  1.0000000
##   3     100       400    0.9910159  0.8300000  1.0000000
##   3     200       200    0.9910159  0.8300000  1.0000000
##   3     200       300    0.9897460  0.8400000  1.0000000
##   3     200       400    0.9903492  0.8400000  0.9980952
##   4      50       200    0.9913333  0.8633333  1.0000000
##   4      50       300    0.9916508  0.8666667  1.0000000
##   4      50       400    0.9910159  0.8533333  1.0000000
##   4     100       200    0.9909841  0.9000000  1.0000000
##   4     100       300    0.9897143  0.8666667  1.0000000
##   4     100       400    0.9916508  0.8766667  1.0000000
##   4     200       200    0.9906984  0.8766667  1.0000000
##   4     200       300    0.9925873  0.8666667  1.0000000
##   4     200       400    0.9929206  0.8533333  1.0000000
##   5      50       200    0.9916508  0.9233333  1.0000000
##   5      50       300    0.9910159  0.8966667  1.0000000
##   5      50       400    0.9922857  0.8966667  1.0000000
##   5     100       200    0.9903810  0.8966667  1.0000000
##   5     100       300    0.9916508  0.8866667  1.0000000
##   5     100       400    0.9916508  0.9100000  1.0000000
##   5     200       200    0.9922857  0.8866667  1.0000000
##   5     200       300    0.9910159  0.8866667  1.0000000
##   5     200       400    0.9916508  0.8633333  1.0000000
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, ntree = 400 and sampsize
##  = 200.

Evaluation

# Evaluation of Model

Fairness Assesment

Conclusion

What can you say about the results of the methods section as it relates to your question given the limitations to your model?

Future Recommendations

What additional analysis is needed or what limited your analysis on this project?